Abstract
Before conducting a single cell RNA-seq experiment on a tumor sample, it is important to project how subclone detection power depends on the number of cells sequenced and the coverage per cell. To facilitate experiment design, we developed a tool, DENDROplan (Figure 2), that predicts the expected clustering accuracy by DENDRO given sequencing parameters. Given a tree structure and a target accuracy, DENDROplan computes the necessary read depth and number of cells needed based on the spike-in procedure described above. For more detail, please check our biorixv preprint
Install all packages in the latest version of R.
devtools::install_github("zhouzilu/DENDRO")In addition to DENDRO package, we also needed to download the reference RNA-seq dataset.
If you have any questions or problems when using DENDROplan, please feel free to open a new issue here. You can also email the maintainers of the package – the contact information is below.
Zilu Zhou (zhouzilu at pennmedicine dot upenn dot edu)
Genomics and Computational Biology Graduate Group, UPenn
Nancy R. Zhang (nzh at wharton dot upenn dot edu)
Department of Statistics, UPenn
Figure 2 illustrate the overall pipeline. Figure 2. A flowchart outlining the procedures for DENDROplan.
Our reference dataset is from Deng et al. with great sequencing properties. First, let’s load DENDROplan_ref and check the structure of the variable ref
library(DENDRO)
data("DENDROplan_ref")
str(ref)
#> List of 3
#> $ X1 : int [1:22958, 1:133] 0 0 0 0 0 0 0 0 0 0 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:133] "GSM1112611_earlyblast_2-1_expression.txt" "GSM1112612_earlyblast_2-10_expression.txt" "GSM1112613_earlyblast_2-12_expression.txt" "GSM1112614_earlyblast_2-15_expression.txt" ...
#> $ X2 : int [1:22958, 1:133] 0 0 0 0 0 0 0 0 0 0 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:133] "GSM1112611_earlyblast_2-1_expression.txt" "GSM1112612_earlyblast_2-10_expression.txt" "GSM1112613_earlyblast_2-12_expression.txt" "GSM1112614_earlyblast_2-15_expression.txt" ...
#> $ exp_level: chr [1:22958] "low" "low" "low" "low" ...The trees that DENDROplan is able to simulate show at Figure 3. The corresponding k and subtype value are also included.
Numbers on branches illustrate index of mutation probability (variable lprob) and numbers at node illustrate index of cell proportion (variable kprob). The interested clade is labeled by a star.
Figure 3. Example tree structure DENDRO can simulate with corresponding parameters.
We measure DENDROplan results by four statistics, explained here:
Adjust Rand index: Adjusted Rand index is a measure of the similarity between two data clusterings after adjusted for the chance grouping of elements. For details, see here
Capture rate: Capture rate is a measure of “false negative rate” of a specific clade. Out of all the cells from the specific clade, how many of them is detected by the algorithm.
Purity: Purity is a measure of “false positive rate” of a specific clade. Out of all the cells in the “specific cluster” you detected, how many are actually from the true specific clade.
Observation probability: This is only one single value, which measures the probability of observe all clades in our multipe simulation round.
Assume we only have 100 mutation sites and our tree looks like Figure 3c with mutations and cells uniformly distributed
res=DENDRO.simulation(n=100,ref=ref,k=4,subtype=1)
#> There are total 4 clades in our simulationres
#> AdjustRandIndex Capture rate Purity Observation probability
#> CI_low 0.6300837 0.7692308 0.6923077 NA
#> Mean 0.8517463 0.9454780 0.9534285 1
#> CI_up 1.0000000 1.0000000 1.0000000 NAResult shows the example tree structure as well as statistics with 95% Confidence Interval (CI).
In the 2nd example, we want to specify the similar tree but with customized cell proportion. We want the four clusters have cell proportion 0.2, 0.2, 0.2 and 0.4, i.e. we want one major subclone. In our code we need to specify this by parameter kprob.
res=DENDRO.simulation(kprob=c(0.2,0.2,0.2,0.4),n=100,ref=ref,k=4,subtype=1)
#> There are total 4 clades in our simulationres
#> AdjustRandIndex Capture rate Purity Observation probability
#> CI_low 0.5999641 0.7027027 0.8437500 NA
#> Mean 0.8631731 0.9402432 0.9727299 1
#> CI_up 1.0000000 1.0000000 1.0000000 NAExample tree shows consistency with our input.
One important reason for user to use this tool is that it can estimate the cluster accuracy given different sequencing depth. In the original Deng et al. paper, they collect around 10,000,000 ~50bp reads per cell in mice, resulting 46 mapped reads for each mutation site, which is super high depth (Figure 4).
Figure 4. scRNA-seq library statistics for Deng et al. Modified from original paper.
In real life, usually 1,000,000 reads per cell is pretty good depth. Proportionally, there are around 4.5 mapped reads per mutation site. Let’s change the parameter RD and see how well DENDRO performs.
res=DENDRO.simulation(RD=4.5,n=100,ref=ref,k=4,subtype=1)
#> There are total 4 clades in our simulationres
#> AdjustRandIndex Capture rate Purity Observation probability
#> CI_low 0.5600606 0.7241379 0.7307692 NA
#> Mean 0.8192305 0.9327139 0.9453612 1
#> CI_up 0.9744867 1.0000000 1.0000000 NAFeel free to play with different parameters. The detailed function document can be found by typing ??DENDRO.simulation in R
sessionInfo()
#> R version 3.4.1 (2017-06-30)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 10240)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United States.1252
#> [2] LC_CTYPE=English_United States.1252
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] DENDRO_0.1.0 ggplot2_2.2.1 mclust_5.4 phyclust_0.1-22
#> [5] colorspace_1.3-2 svMisc_1.1.0 phangorn_2.4.0 ape_5.0
#> [9] dendextend_1.8.0 TailRank_3.2.1 oompaBase_3.2.6
#>
#> loaded via a namespace (and not attached):
#> [1] modeltools_0.2-21 kernlab_0.9-25 lattice_0.20-35
#> [4] oompaData_3.1.1 htmltools_0.3.6 stats4_3.4.1
#> [7] viridisLite_0.3.0 yaml_2.1.16 rlang_0.2.2
#> [10] prabclus_2.2-6 BiocGenerics_0.24.0 fpc_2.1-11
#> [13] plyr_1.8.4 robustbase_0.92-8 stringr_1.3.1
#> [16] munsell_0.4.3 gtable_0.2.0 mvtnorm_1.0-7
#> [19] evaluate_0.10.1 labeling_0.3 Biobase_2.38.0
#> [22] knitr_1.17 flexmix_2.3-14 parallel_3.4.1
#> [25] class_7.3-14 DEoptimR_1.0-8 trimcluster_0.1-2
#> [28] Rcpp_0.12.18 scales_0.5.0 backports_1.1.0
#> [31] diptest_0.75-7 gridExtra_2.3 fastmatch_1.1-0
#> [34] digest_0.6.12 stringi_1.1.7 grid_3.4.1
#> [37] rprojroot_1.2 quadprog_1.5-5 tools_3.4.1
#> [40] magrittr_1.5 lazyeval_0.2.1 tibble_1.3.4
#> [43] cluster_2.0.6 whisker_0.3-2 pkgconfig_2.0.1
#> [46] MASS_7.3-47 Matrix_1.2-14 rmarkdown_1.8
#> [49] viridis_0.5.0 nnet_7.3-12 igraph_1.1.2
#> [52] nlme_3.1-131 compiler_3.4.1